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We show how the shared information between the past and future — the excess entropy — derives 
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INTRODUCTION 

Predicting and modeling a system are distinct, but 
intimately related goals. Leveraging past observations, 
prediction attempts to make correct statements about 
what the future will bring, whereas modeling attempts 
to express the mechanisms behind the observations. In 
this view, building a model from observations is tanta- 
mount to decrypting a system's hidden organization. The 
cryptographic view rests on the result that the appar- 
ent information shared between past and future — the ex- 
cess entropy, which sets the bar for prediction — is only a 
function of the hidden stored information — the statistical 
complexity pQ. 

The excess entropy, and related mutual information 
quantities, though, are widely used diagnostics for com- 
plex systems, having been applied to detect the presence 
of organization in dynamical systems [2H5], i n spin sys- 
tems E] , in neurobiological systems H] , and even 
in human language (TOl QJ] . 

For the first time, Ref. pQ connected the observed 
sequence-based measure, the excess entropy, to a sys- 
tem's internal structure and information processing. One 
consequence of the connection, and so our ability to dif- 
ferentiate between them, is that the excess entropy is 
an inadequate measure of a process's organization. One 
must build models. 

Our intention here is rather prosaic, however. We pro- 
vide a focused and detailed proof of this relationship, 
which appears as Thm. 1 in Ref. [1] in a necessarily abbre- 
viated form. A proof also appears in Ref. 12J employing 
a set of manipulations, developed but not laid out explic- 
itly there, that require some facility with four-variable 
mutual informations and with subtle limiting properties 
of stochastic processes. The result is that directly ex- 
panding either of these concise proofs, without first de- 
riving the rules, leads to apparent ambiguities. 

The goal in the following is to present a step-by-step 
proof, motivating and explaining each step and attendant 
difficulties. The development also allows us to emphasize 
several new results that clarify the challenges in analyti- 



cally calculating and empirically estimating these quan- 
tities. To get started, we give a minimal summary of the 
required background, assuming familiarity with Refs. [1. 
and [12] . information theory [13] . and information mea- 
sures P3]. 

BACKGROUND 

A process Pr(X, X) is a communication channel with 
a fixed input distribution Pr(A^): It transmits informa- 
tion from the past A 7 = . . . X-3X-2X-1 to the future 
~jl = A"oA"iA"2 ••■ by storing it in the present. X t de- 
notes the discrete random variable at time t taking on 
values from an alphabet A. A prediction of the process 
is specified by a distribution Pr(A^|tF) of possible futures 
given a particular past ttT. At a minimum, a good 
predictor — call it 1Z — must capture all of a process's ex- 
cess entropy |15j — the information / shared between past 
and future: E = I That is, for a good predictor: 

E = I[TZ; A^]. 

Building a model of a process is more demanding than 
developing a prediction scheme, though, as one wishes 
to express a process's mechanisms and internal organi- 
zation. To do this, computational mechanics introduced 
an equivalence relation that groups all histories 

which give rise to the same prediction. The result is a 
map e : X — > S from pasts to causal states defined by: 

eft) = {V : Pr(A^) = Pr(A^')} . (1) 

In other words, a process's causal states are equivalence 
classes— S = Pr(A, ~$)/ ~ — that partition the space X 
of pasts into sets which are predictively equivalent. The 
resulting model, consisting of the causal states and tran- 
sitions, is called the process's e-machine [TB]. Out of all 
optimally predictive models 1Z resulting from a partition 
of the past, the e-machine captures the minimal amount 
of information that a process must store — the statistical 
complexity = H[S], 

Said simply, E is the effective information transmission 
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rate of the process, viewed as a channel, and is the 
sophistication of that channel. In general, the explicitly 
observed information E is only a lower bound on the 
information that a process stores |16j . 

The original development of e-machines concerned us- 
ing the past to predict the future. One can, of course, 
use the future to retrodict the past by scanning the mea- 
surement variables in the reverse-time direction, as op- 
posed to the default forward-time direction. With this 
in mind, the original map e(-) from pasts to causal states 
is denoted e + and it gave, what are called, the predic- 
tive causal states S + . When scanning in the reverse di- 
rection, we have a new equivalence relation, ~af x', 
that groups futures which are equivalent for the purpose 
of retrodicting the past: £^(2^) = {"af' : Pr(Ajlf) = 
Pr(X|~of')}. It gives the retrodictive causal states S = 

Pr(X^)/ ~~- 

In this bidirectional setting we have the forward-scan 
e-machine M + and its reverse-scan e-machine M~ . From 
them we can calculate corresponding entropy rates, /i+ 
and h~, and statistical complexities, C+ = H[S + ] and 
C~ = H[S~], respectively. Notably, while a stationary 
process is equally predictable in both directions of time — 
/i+ = h~ — the amount of stored information differs in 
general: C+ ? C~ [J. 

Recall that Thm. 1 of Ref. 1J showed that the shared 
information between the past X and future X is the 
mutual information between the predictive (M + 's) and 
retrodictive (M~'s) causal states: 

E = I[S+;S-} , (2) 

This led to the view that the process's channel utilization 
is the same as that in the channel between a 
process's forward and reverse causal states. 

To understand how the states of the forward and 
reverse e-machines capture information from the past 
and the future — and to avoid the ambiguities alluded to 
earlier — we must analyze a four- variable mutual informa- 
tion: I0C; x"; S + ; S~]. A large number of expansions of 
this quantity are possible. A systematic development fol- 
lows from Ref. 14 which showed that Shannon entropy 
H[-] and mutual information /[• ; •] form a signed measure 
over the space of events. 

TWO ISSUES 

The theorem's proof can be expressed in a very com- 
pact way using several (implied) rules: 

E = I[X~;x'] (3) 
= I[e+(^);e-(^)} (4) 
= I[S+;S-}. (5) 



While this proof conveys the essential meaning and, being 
short, is easily intuited, there are two issues with it. The 
concern is that, if the concise proof is misinterpreted or 
the rules not heeded, confusion arises. Refs. [T] and [H] 
develop the appropriate rules, but do not lay them out 
explicitly. 

The first issue is that naive expansion of the past- 
future mutual informations leads to ambiguously inter- 
pretable quantities. The second issue is that implicitly 
there are Shannon entropies — e.g., -fffA 7 ] and H[jt] 
over semi-infinite chains of random variables and these 
entropies diverge in the general case. Here, via an exe- 
gesis of the concise proof, we show how to address these 
two problems and, along the way, explicate several of the 
required rules. We diagnose the first issue and then pro- 
vide a new step-by-step proof, ignoring the second issue 
of divergent quantities. We end by showing how to work 
systematically with divergent entropies. 

COMPARABLE OBJECTS AND SUFFICIENCY 

The first problem comes from inappropriate applica- 
tion of the e(-) functions. The result is the inadvertent 
introduction of incomparable quantities. Namely, 

Pr(^fX) =Pr(A t |e+(A 7 )) 

= Pr(AV+) (6) 

is a proper use of the predictive causal equivalence 
relation — the probabilities at each stage refer to the same 
object, the future A^. We say that the predictive causal 
states are sufficient statistics for the future. 

The following use (in the first equality) is incorrect, 
however: 

Pi^A 7 ) =Pr(e+(A 7 )) 
= Pr(S+) , 

even though it appears as a straightforward (and analo- 
gous) application of the causal equivalence relation. The 
problem occurs since the first equality incorrectly con- 
flates probability of two different objects — a future and 
a causal state; an element and a set. A handy mnemonic 
for the appearance of this error is to interpret the ex- 
pression literally: typically, a causal state has positive 
probability, but an infinite future has zero probability. 
Clearly, a wrong statement. 

There are restrictions on when the causal equivalence 
relation can be applied. In particular, in the shorthand 
proof of Thm. [T] above, there are ambiguous expansions 
of the mutual information that lead one to such errors. 
These must be avoided. 

Specifically, the step (Eq. Q) involving the simultane- 
ous application of the forward and reverse causal equiv- 
alence relations must be done with care. Here, we show 
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how to do this. But, first, let's explore the problem a bit 
more. Starting from Eq. ([3]), we go one step at a time: 



This seems fine, since no overtly infinite quantities appear 
and e~ (•) is used only in conditioning. 



(7) 



The result is correct, largely because one has in mind 
the more detailed series of steps using the mutual infor- 
mation's component entropies. That is, let's redo the 
preceding: 



= H0£] - h\j£\!x] 

= H[t]-H[Z\e+(x)] 
= H\t] - H\j£\S+] 



(8) 



(9) 



Notice that the application of e + (-) occurs only in condi- 
tioning. Also, for the sake of argument, we temporarily 
ignore the appearance of the potentially infinite quantity 
H[2]. 

To emphasize the point, it is incorrect to continue the 
same strategy, however. That is, picking up from Eq. ([9| 
the following is ambiguous: 

I0C;1] = I[S + ;1] 

= I[S + -e-(Z)\ 
= I[S+;S-} 

even though the final line is the goal and, ultimately, is 
correct. Why? To see this, we again expand out the 
intermediary steps implied: 

I[S + ;jf] = H\j£] - H$\S+] (10) 
= H{e-(t)}-H\e-(3)\S + ] (11) 
= H[S~] -H[S~\S + ] 
= I[S~;S+] . 



That second step (Eq. (11)), by violating the rule of 
matching objects types, is wrong. And so, the ensuing 
steps do not follow, even if the desired result is obtained. 

The conclusion is that the second use of the causal 
equivalence relation, seemingly forced in the original 
short proof of Thm. [I] is not valid. The solution is to 
find a different proof strategy that docs not lead to this 
cul de sac. 

There is an alternative expansion to Eq. ( 10 ) that 
appears to avoid the problem: 



I[S+;^} = H[S+] - H[S + \^] 

= H[S + ]-H[S+\e-(2) 
= H[S+] - H[S+\S-] 
= I[S-;S+] . 



(12) 



The step to Eq. ( 12 ) is still problematic, though. The 



concern is that, on the one hand, the retrodictive causal 
states are sufficient for the pasts, as indicated in Eq. ||6j). 
On the other hand, it does not immediately follow that 
they are sufficient for predictive causal states, as required 



by Eq. (121 



In short, these problems result from ignoring that the 
goal involves a higher-dimensional, multivariate problem. 
We need a strategy that avoids the ambiguities and gives 
a reliable procedure. This is found in using the four- 
variable mutual informations introduced in Refs. [1] and 
[12] . This is the strategy we now lay out and it also serves 
to illustrate the rules required for the more concise proof 
strategy. 



DETAILED PROOF 

In addition to the rule of not introducing incomparable 
objects, we need several basic results. First, the causal 
equivalence relations lead to the informational identities: 

H[S+\%=0, 

H[S-\2] = . 

That is, these state uncertainties vanish, since e + (-) and 
e~(-) are functions, respectively, of the past and future. 

Second, causal states have the Markovian property 
that they render the past and future statistically inde- 
pendent. They causally shield the future from the past: 

I[%3\S+]=Q, 
/[A 7 ; aV~] =0 . 

In this way, one sees how the causal states are the struc- 
tural decomposition of a process into conditionally inde- 
pendent modules. Moreover, they are defined to be opti- 
mally predictive in the sense that knowing which causal 
state a process is in is just as good as having the entire 
past in hand: Pr(A t |5+) = Pr^A 7 ) or, equivalently, 

E = 7[5+; A*]. 

Now, we consider several additional identities that fol- 
low more or less straightforwardly from the e-machine's 
defining properties. 

Lemma 1. /[<S+; <S~ | A 7 ] = and 7[<S+; <S~ | A 7 ] = 0. 

Proof. These vanish since the past (future) determines 
the predictive (retrodictive) causal state. □ 

Lemma 2. /[A 7 ; jt; S~ \S+] = 0. 
Proof. 

/[A 7 ; A 7 ; S+\S~] = /[A 7 ; aV~] - /[A 7 ; ^S+.S"] 
= 0-0 . 
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The terms vanish by causal shielding. □ 

Lemma 3. I[«S+;«S-; A^A 7 ] = 0. 

Proof. 

I[S+;S-;^0C] =I[S+;S-\ < X] 

-I[S+;S-\$,t] . 

The first term vanishes by Lemma^ Expanding the sec- 
ond term we see that: 

I[S+;S-0C,'2]=H[S + \%,'£] 

Both terms here vanish since the past determines the pre- 
dictive causal state. □ 

Now, we are ready for the proof. First, recall the the- 
orem's statement. 

Theorem 1. Excess entropy is the mutual information 
between the predictive and retrodictive causal states: 



E = I[S+;S-} . 



(13) 



Proof. This follows via a parallel reduction of the 
four-variable mutual information /[A; <S + ; S~] into 
and I[S + ;S ]. The first reduction is: 

l(X;l£;S + ;S-] =l(X;l;S+] - l(X;l;S + \S-] 

= I0C;J£;S + ] 

= l(X;'£]-l(X;~2\S + ] 

= I[%1] 
= E . 

The second line follows from Lemma [1| and the fourth 
from causal shielding. 

The second reduction is, then: 

I[X;J?;S + ;S-} = I[S + ; S~; A*] - J[<S + ;<S _ ; jt0C] 

= I[S+;S-}-I[S + ;S-\Jl} 
= I[S+;S-] . 

The second line follows from Lemma [$| and the fourth 
from Lemma^ □ 

Remark. Note that the steps here do not force one into 
inadvertently using the causal equivalence relation to in- 
troduce incomparable objects. 



FINITE PASTS AND FUTURES 

This is all well and good, but there is a nagging concern 
in all of the above. As noted at the beginning, we are 
improperly using entropies of semi-infinite chains of ran- 
dom variables. These entropies typically are infinite and 
so many of the steps are literally not correct. Fortunately, 
as we will show, this concern is so directly addressed that 
there is rarely an inhibition in the above uses. The short- 
cuts that allow their use are extremely handy and allow 
much progress and insight, if deployed with care. Ulti- 
mately, of course, one must still go through proofs using 
proper objects and manipulations and verifying limits. 
We now show how to address this issue, highlighting a 
number of technicalities that distinguish between impor- 
tant process classes. 

The strategy is straightforward, if somewhat tedious 
and obfuscating: Define pasts, futures, and causal states 
over finite-length sequences. 

Definition. Given a process Pr(A*), its finite predictive 
causal states S^ L are defined by: 

e + KL Cx- K ) = [% IK : Pr(A^|lF K ) = Pv(l L \^ ,K )} . 

Definition. Given a process Pr( A^), its finite retrodic- 
tive causal states S^ L are defined by: 

e~ KL (^ L ) = {-x" L : ¥t(x k [x> l ) = Pr(A^|^ L )} . 

That is, we now partition finite pasts (futures) of 
length K (L) with probabilistically distinct distributions 
over finite futures (pasts). We end up with two sets, S^ L 
and SJ <L , which describe the finite- length predictive and 
retrodictive causal states for each value of K and L. 

Remark. The subscripts on S^ L and should not be 
interpreted as time indices, as they are more commonly 
used in the literature. 

Remark. A central issue here is that, in general, for the 
causal states S + defined by Eq. Tjty: 



S+ ^ lim S 



K,L- 



KL 



The analogous situation is true for S 
processes, it can happen that |<S^ L | 



(14) 

Why? For some 
co even though 



\S I < oo. The result is that the causal states S + are 
not reached in the above limiting procedure. However, 
their information content can be the same. And so, in 
the following, we must take care in establishing results 
regarding the large-K and -L limits. 

A first example of this is to explain why the appli- 
cations of e(-) in Eqs. ^ and ^ are plausible. We 
establish the finite-length version of those steps. 
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Proposition 1. H[t L \X K ] = H\t L \S^ Ll . 
Proof. We calculate directly: 

H[i L \X K ] 

= ^¥i{w)H[i L \*X K =w] 

w£A K 

= ^Pv(w)H[l L \S+ L = e+ L (w)] 

wGA K 

= y £H[^ L \S+ L =a]PT(a) 

= H[1 L \S+ L ]. □ 

And so, for all L, we have: 

H& L \S+ L ]=lhn H[1 L \S+ L ] 
= km H[1 L \X K ] 



The last step requires a measure-theoretic justification. 
This is given using the method of Ref . jTTJ Appendix] . 

Corollary 1. I[X K ; ~]t L ] = I[S^ L ;~1t L ]. 

Proof. Following a finite-lengths version of Eq. |?|), we 
apply Prop. [7J 

By similar reasoning in the proposition and corollary 
we have the time-reversed analogs: 



H(x K \i L ]=H(x K \S KL ] , 
I{X«^]=I{X K ;S KL }. 



Definition. The finite-length excess entropy is: 

E(K,L) = I\X K - ~^ L ] . 

Lemma 4. E = Hw.k,l->-oo E(iT, L). 

Proof. It is known that I converges to E \15[ 

\18jj . Thus, it follows straightforwardly that I 
also converges to E, so long as K and L simultaneously 
diverge to infinity. □ 

We are now, finally, ready to focus in more directly on 
the original goal. 

Proposition 2. E(JT, L) = I[S£ L ;Sx L ]. 



Proof. The proof relies on finite-length analogs to Lem- 
mas [7| [J| and [3] and then proceeds similarly to Thm. [7J 
Specifically, 

I[1t K ;S+ L] S- KL {l L }=I[1c K ;l L ] 

follows from the first reduction in the proof of Thm. [I] 
and: 

I\X K ;S^ (L ;S] CL ;X L ] = I[S^ L ;S KL ] 

follows from the second reduction there. All that is 
changed in the reductions is the substitution of finite- 
length quantities. Otherwise, the information-theoretic 
identities hold as given there. □ 

Theorem 2. The excess entropy is: 



E= lim I[S+ L ;S« L } . 

K,L— >oo 



Proof. By Lemma^ we relate E to the sequence of mu- 
tual informations between the finite past and finite future. 
By Prop. [I| this limit is also equal to the limit of mutual 
informations between the finite predictive and finite retro- 
dictive causal states. □ 

Remark. As with Lemma^ the limits in K and L must 
be done simultaneously. 

At this point, we have gone as far as possible, it seems, 
in relating the finite-length excess entropy and forward- 
reverse causal-state mutual informations. From here on, 
different kinds of process have different limiting behav- 
iors. We discuss one such class and so establish the orig- 
inal claim. 

Recall the class of processes that can be represented 
by exactly synchronizing e-machines. Roughly speaking, 
such a process has an e-machine to which an observer 
comes to know its internal state from a finite number of 
measurements. (For background see Ref. [18].) This is 
the class of processes we focus on in the following. 

Lemma 5. If M + and M~ are both exactly synchroniz- 
ing and each has a finite number of (recurrent) causal 
states, then: 



I[S A 



:S ] = lim 

K,L->c 



(15) 



Proof. Finitary processes that are exactly synchroniz- 
able have at least one finite-length synchronizing word. 
And this sync word occurs in almost every sufficiently 
long sequence. Thus, as K and L simultaneously tend 
to infinity, one eventually constructs a partition that in- 
cludes a synchronizing word. From there on, increasing 
K and L eventually discovers all infinite-length causal 
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states, which are finite in number by assumption. The 
result is that probability accumulates in the subset of 
finite-length causal states which correspond to the causal 
states which are reachable, infinitely-preceded, and recur- 
rent Thus, the limit of the finite-length causal states 
differs from the infinite-length causal states only on a set 
of measure zero. Finally, also by assumption, this holds 
for both the forward and reverse e-machines. And so, the 
information content in the finite-length causal states lim- 
its on the information content of the causal states which, 
by Eq. |7p, are defined in terms of semi-infinite pasts and 
futures. □ 

Theorem 3. If M + and M~ are both exactly synchro- 
nizing and each has a finite number of (recurrent) causal 
states, then: 

E = I[S + ;S-} . 



Proof. Directly from Thm. [J] and Lemma^ 

CONCLUSION 

In the preceding, we examined an evocative and, in its 
simplicity, innocent-looking identity: E = J[iS + ; It 
tells us that the excess entropy is equal to the mutual in- 
formation between the predictive and retrodictive causal 
states. It begins to reveal its subtleties when one realizes 
that excess entropy is defined solely in terms of the ob- 
served process Pr(A; A^) and makes no explicit reference 
to the process's internal organization. Additionally, 
and ~]£ are continuous random variables, when S + and 
S~ need not be. 

In explicating their relationships, finite-length coun- 
terparts to the predictive and retrodictive causal states 
were introduced, and the limit was taken as the finite- 
lengths tended to infinity. A priori, there is no reason to 
expect that the finite-length causal states will limit on 
the causal states, since the latter are defined over infinite 
histories and futures. In fact, there are finitary processes 
for which the number of finite-length causal states di- 
verges, even when the number of (asymptotic, recurrent) 
causal states is finite. 

However, when considering exactly synchronizing 
e-machines, there exists a subset of the finite-length 
causal states at each K and L that does limit on the 
causal states. When such e-machines have a finite num- 
ber of causal states, it is possible to identify this subset. 
This fact was used to prove Thm. [5] 

When this subset of the finite-length causal states can- 
not be identified or when it does not exist, it is still ex- 
pected that the limit of mutual informations between the 
finite-length causal states will equal the mutual informa- 
tion between the predictive and retrodictive causal states. 



However, the proof for this requires more sophistication 
and the technique for calculating E, outlined in Ref. |12j . 
needs refining. The set of e-machines that are not exactly 
synchronizing are among those that would benefit from 
such analysis. 

The information diagram of Figure [T] closes our devel- 
opment by summarizing the more detailed finite-history 
and -future framework introduced here. The various lem- 
mas, showing that this or that mutual information van- 
ished, translate into information-measure atoms having 
zero area. The overall diagram is quite similar to that in- 
troduced in Ref. [T] , which serves to emphasize the point 
made earlier that working with infinite sequences pre- 
serves many of the central relationships in a process's 
information measure. It also does not suffer from the 
criticism, as did the previous one, of representing infinite 
atoms as finite. 

The information diagram graphically demonstrates 
that, as done in the detailed proof given for Thm. [3j one 
should avoid using potentially infinite quantities, such as 
and H[X], whenever possible, in favor of alterna- 
tive finite atoms, which are various mutual informations 
and conditional mutual informations. Moreover, when 
infinite atoms cannot be avoided, then the finite-length 
quantities must be used and their limits carefully taken, 
as we showed. 




FIG. 1. e-Machine information diagram over finite length-ii" 
past and length-L future sequences for a stationary stochastic 
process. 
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